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Exploring  Posterior  Distributions  Using  Markov  Chains 
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Abstract 

Several  Markov  chain-based  methods  are  available  for 
sampling  from  a posterior  distribution.  Two  important 
examples  are  the  Gibbs  sampler  and  the  Metropolis  al- 
gorithm. In  addition,  several  strategies  are  available 
for  constructing  hybrid  algorithms.  This  paper  outlines 
some  of  the  strategies  that  are  available,  and  discusses 
some  theoretical  and  practical  issues  in  the  use  of  these 
strategies.  In  addition,  some  preliminary  efforts  to  use 
Markov  chains  to  control  dynamic  graphics  for  exploring 
higher-dimensional  posterior  distributions  are  outlined. 

1 Introduction 

Suppose  we  are  given  a posterior  distribution  r on  a 
quantity  0 with  values  in  a space  E.  Usually  E will  be 
a subset  of  IRfc  and  it  will  have  a density  with  respect  to 
a measure  /x, 

it  (dp)  = it  (x)y(dx). 

For  simplicity,  it  will  be  used  to  denote  both  the  distribu- 
tion and  the  density.  We  may  be  interested  in  computing 
a particular  numerical  characteristic  of  it,  or  more  gener- 
ally in  developing  an  understanding  of  what  information 
it  contains  about  0. 

Several  methods  for  computing  characteristics  of  pos- 
terior distributions  are  now  available.  These  include 
asymptotic  approximations,  numerical  integration,  and 
sampling  or  Monte  Carlo  methods.  Sampling  meth- 
ods for  examining  posterior  distributions  provide  ways 
of  generating  samples  with  the  property  that  the  em- 
pirical distribution  of  the  sample,  or  an  appropriately 
weighted  empirical  distribution,  approximate  the  poste- 
rior distribution.  Using  such  samples,  it  is  easy  to  esti- 
mate characteristics  such  as  the  mean  or  standard  devi- 
ation of  a function  of  0.  Marginal  distributions  can  be 
estimated  using  smoothing  or,  in  some  cases,  variance 
reduction  methods.  In  addition,  for  equally  weighted 
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samples  methods  for  viewing  point  clouds,  such  as  ro- 
tating plots  and  Grand  Tours,  can  be  used  to  examine 
the  joint  uncertainty  about  three  or  more  components  or 
features  of  0. 

A number  of  different  sampling  methods  are  available. 
In  rare  cases  it  is  possible  to  sample  directly  from  the 
posterior  distribution  and  thus  obtain  an  i.i.d.  sample 
from  it.  In  most  problems  this  is  not  possible.  Either 
the  sample  has  to  te  dependent,  or  the  distribution  used 
to  generate  the  sample  has  to  be  different  from  it.  A 
method  that  uses  independent  samples  from  a distribu- 
tion similar  to  it  is  importance  sampling.  The  sample  is 
then  weighted  to  make  up  for  the  difference  between  it 
and  the  distribution  used  to  generate  the  sample.  Over 
the  past  decade,  most  work  on  sampling  methods  for 
exploring  posterior  distributions  has  centered  on  impor- 
tance sampling  (Geweke,  1989;  Stewart,  1979;  van  Dijk 
ei  al.,  1978;  Zellner  and  Rossi,  1984;  among  others).  An 
alternative  approach  that  avoids  the  need  for  weights  is 
to  use  a dependent  sample,  such  as  the  sample  path  of 
a Markov  chain. 


2 Markov  Chain  Methods 

Markov  chain  methods  generate  a sample  path  from 
a Markov  chain  that  has  it  as  its  stationary  distribu- 
tion. Recent  work  of  Gelfand  and  Smith  (1990)  on 
the  Gibbs  sampling  algorithm  has  renewed  interest  in 
Markov  chain  methods  for  exploring  posterior  distribu- 
tions. Gelfand  and  Smith  extend  the  Gibbs  sampling 
algorithm  of  Geman  and  Geman  (1984),  originally  de- 
veloped for  Bayesian  image  reconstruction,  to  continu- 
ous distributions  and  show  how  the  algorithm  can  be 
used  in  a wide  variety  of  problems. 

Markov  chain  methods  have  a long  history  in  Mathe- 
matical physics  dating  back  to  the  algorithm  of  Metropo- 
lis et  al  (1953).  The  Metropolis  algorithm  is  in  fact  a 
general  class  of  algorithms  that  includes  versions  of  the 
discrete  Gibbs  sampler  as  special  cases. 
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2.1  The  Metropolis  Algorithm 


2.1.1  Random  Walk  Chains 


Metropolis  ei  al.  (1953)  originally  proposed  the  al- 
gorithm now  known  as  the  Metropolis  algorithm  as  a 
method  of  sampling  from  the  equilibrium  distribution 
of  an  interacting  particle  system.  The  algorithm,  which 
is  described  in  Hammersley  and  Handscomb  (1964,  Sec- 
tion 9.3)  and  Ripley  (1987,  Section  4.7),  was  extended  by 
Hastings  (1970)  and  explored  further  by  Peskun  (1973). 

To  define  Hastings  version  of  the  algorithm,  let  Q be 
a Markov  transition  kernel  with 


Q(Xjdy)  = q{x,y)fi(dy). 


Let  E+  = {x  : ;r(x)  > 0},  and  assume  Q(x,  E+)  = 1 for 
x £ E+.  Then  define 


a(x,  y)  = min 


f *(y)q(y,x)  .1 
l *(*)$(*,  2/)’  J 


for  it (x)q(x,y)  > 0.  Otherwise,  a(x,y)  = 1.  If  the 
Markov  chain  is  currently  at  X„  = x,  then  the  algo- 
rithm generates  a candidate  Y = y for  the  next  state 
from  Q(x,  •).  With  probability  a(x,y)  this  candidate  is 
accepted  and  the  chain  moves  to  Xn+i  = y.  Other- 
wise, the  candidate  is  rejected  and  the  chain  remains  at 
•Xn+1  = x. 

Since 


* (x)q{x,y)a(x,y)  = it  (y)q(y,x)a(y,x), 


a Metropolis  chain  with  initial  distribution  it  is  re- 
versible. Therefore  i r is  an  invariant  distribution  for  the 
chain.  Some  additional  conditions  on  it  and  Q are  needed 
to  insure  that  it  is  also  a limiting  distribution;  these  con- 
ditions are  discussed  in  Section  3 below.  Since  the  accep- 
tance probability  only  depends  on  it  through  the  ratio 
it(y)/it(x),  the  density  it  only  needs  to  be  specified  up 
to  a constant  of  proportionality. 

If  q(x,y)  = q(y,x),  t.e.  q is  symmetric,  then  the  ac- 
ceptance probability  o(x,  y)  simplifies  to 


a(x,  y)  = min 


This  is  the  original  form  of  the  algorithm  proposed  by 
Metropolis  ei  ai.  (1953).  Other  forms  of  the  rejection 
probability  are  possible,  but  the  form  given  here  can 
be  shown  to  be  optimal  within  a wide  class  of  possible 
alternative  forms  (Peskun,  1973). 

The  Metropolis  algorithm  is  actually  a class  of  algo- 
rithms. Each  different  choice  of  the  kernel  Q for  gen- 
erating candidate  steps  produces  a different  version  of 
the  algorithm.  Several  classes  of  kernels  appear  to  be 
particularly  useful  for  examining  posterior  distributions. 


For  E = VtLk  and  / a density  on  E,  set  Y = x + Z , with 
Z drawn  independently  from  /.  Then 

q(x,y)  = f(y-x). 

Thus  the  kernel  Q driving  the  Metrolopis  chain  is  a ran- 
dom walk.  Natural  choices  of  / are  normal,  uniform,  and 
t distributions.  Split-t  distributions  (Geweke,  1989)  may 
also  be  useful.  The  scale  matrix  for  / can  be  taken  as  a 
constant  c times  the  inverse  information  at  the  posterior 
mode.  Good  choices  for  the  step  size  constant  c are  still 
an  open  problem,  but  c = 1 and  c = 1/2  seem  to  work 
reasonably  well  in  a number  of  examples. 

If  / is  symmetric  about  the  origin,  i.e.  if  /(z)  = 
/(— z),  then  q is  symmetric  and  the  simpler  form  of  the 
acceptance  probability  a(x,y)  can  be  used. 

2.1.2  Independence  Chains 

Suppose  / is  a density  on  E,  and  we  generate  candidates 
7 independently  from  the  single  density  /.  Then 

q(x,y ) = /(y)- 

The  chain  of  candidates  driving  this  Metropolis  chain  is 
an  i.i.d.  sequence  from  the  density  /.  The  acceptance 
probability  for  an  independence  chain  can  be  written  as 


for  u>(x)  = 7r(x)//(x).  The  function  w is  the  weight 
function  that  would  be  used  in  importance  sampling 
when  the  sample  is  generated  from  the  density  /. 

There  are  a number  of  similarities  between  an  indepen- 
dence chain  and  the  corresponding  importance  sampling 
process.  While  an  independence  chain  does  not  require 
explicit  use  of  the  weights,  it  will  rarely  accept  candi- 
dates with  low  weights.  On  the  other  hand,  a candidate 
with  high  weight  will  almost  always  be  accepted.  Fur- 
thermore, when  the  chain  reaches  a point  x with  high 
weight  u>(x),  it  will  usually  remain  there  for  several  it- 
erations, thus  building  up  weight  on  x within  the  sam- 
ple path  by  repetition.  Another  similarity  to  importance 
sampling  is  that  the  sample  sequence  is  closer  to  an  i.i.d. 
sequence  from  it  the  closer  the  weight  function  w is  to  a 
constant. 

Because  of  these  similarities  to  importance  sampling, 
it  is  reasonable  to  conjecture  that  guidelines  developed 
for  choosing  importance  sampling  densities  also  apply 
to  choosing  densities  for  driving  independence  chains. 
In  particular,  it  is  advisable  to  choose  a density  with 
thicker  tails  than  it  and  thus  a bounded  weight  function 
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w.  Families  like  the  split-*  that  produce  good  impor- 
tance sampling  densities  are  likely  to  be  good  choices  for 
independence  chains. 

2.1.3  Rejection  Sampling  Chains 

An  interesting  special  case  of  an  independence  chain  oc- 
curs when  the  density  / is  sampled  using  rejection  sam- 
pling. In  attempting  to  use  rejection  sampling  to  sample 
directly  from  it,  we  use  a density  h and  a constant  c such 
that,  hopefully,  tt(x)  < c/i(x).  If  we  repeat  the  process  of 
sampling  Z from  h and  then  U uniformly  from  [0,  ch(Z)], 
until  U < it(Z),  then  the  final  value  of  Z has  density 

/(x)  oc  x(x)  A ch(x). 

If  we  do  indeed  have  tt(x)  < c/i(x),  then  / is  proportional 
to  it  and  we  obtain  an  i.i.d.  sample  from  i r.  But  it  is 
very  difficult  to  insure  that  c is  large  enough  for  ch  to 
dominate  7r  without  choosing  c excessively  large,  leading 
to  an  inefficient  algorithm  with  many  rejections.  And 
even  then  without  extensive  analysis  of  the  tails  of  h 
and  it  we  cannot  be  certain  that  ch  does  dominate  it. 

Fortunately,  using  this  rejection  scheme  to  drive  an  in- 
dependence Metropolis  chain  provides  a simple  remedy. 
If  we  do  have  tt(x)  < ch(x)  for  all  x,  then  the  weight 
function  w is  a constant,  no  candidates  are  rejected,  and 
the  rejection  process  produces  an  i.i.d.  sequence  from  it 
that  is  simply  passed  through  the  Metropolis  algorithm 
unchanged.  But  if  ch  does  not  dominate  7r  for  some  x, 
then,  when  the  chain  reaches  such  an  x,  the  Metropo- 
lis algorithm  will  occasionally  reject  candidate  steps  in 
order  to  build  up  mass  on  this  x to  make  up  for  the 
deficiency  in  the  envelope  ch.  This  introduces  some  de- 
pendence, but  insures  that  the  equilibrium  distribution 
of  the  sample  path  is  it  even  if  the  envelope  is  deficient. 

2.2  Combining  Strategies 

The  Gibbs  sampler  and  the  Metropolis  algorithms  de- 
scribed above  provide  a number  of  Markov  chain  strate- 
gies. In  addition  to  choosing  any  one  of  these  strategies 
and  using  it  in  its  pure  form,  it  is  possible  to  form  hybrid 
strategies. 

Suppose  Pi,...,Pm  are  Markov  kernels  with  invari- 
ant distribution  7r.  Two  simple  ways  of  combining  these 
kernels  is  as  a mixture  or  a cycle.  In  a mixture,  proba- 
bilities a i, . . , , am  are  specified,  and  at  each  step  one  of 
the  kernels  is  selected  according  to  these  probabilities. 
In  a cycle,  each  kernel  is  used  in  turn,  and  when  the  last 
one  is  used  the  cycle  is  restarted. 

Both  strategies  can  be  used  in  several  ways.  For  ex- 
ample, a Gibbs  sampler  can  be  combined  with  occasional 


steps  from  an  independence  chain  in  a mixture  or  a cycle 
to  “restart”  the  Gibbs  sampler  and  thus  reduce  correla- 
tions while  preserving  the  equilibrium  distribution.  As 
another  example,  suppose  6 can  be  split  into  two  compo- 
nents (0i, #2) i and  direct  sampling  from  0i|02  is  possible 
but  direct  sampling  from  02|0i  is  not  possible.  Such  a 
situation  is  considered  by  Zeger  and  Karim  (1991).  Then 
“Gibbs  steps”  for  0i  |02  can  be  combined  with  Metropolis 
steps  for  02|0i  in  a mixture  or  a cycle. 

3 Some  Theoretical  Results 

Whatever  approach  is  used  to  produce  a Markov  chain 
with  invariant  distribution  it,  before  the  chain  can  be 
used  with  confidence  to  generate  samples  for  examining 
it  certain  theoretical  questions  need  to  be  addressed.  An- 
swers to  some  of  these  questions  can  be  obtained  using 
some  recent  developments  in  general  state  space  Markov 
chain  theory  as  described,  for  example,  in  Nummelin 
(1984).  This  section  outlines  this  approach.  A more 
complete  discussion  is  given  in  Tierney  (1991). 

3.1  Convergence 

The  first  question  to  be  addressed  is  whether  the  invari- 
ant distribution  it  is  also  the  equilibrium  distribution  for 
the  chain,  t.e.  whether  the  distribution  of  the  chain  af- 
ter n iterations  converges  to  it.  In  discrete  state  space 
Markov  chain  theory,  two  conditions  are  needed:  irre- 
ducibility  and  aperiodicity.  The  same  is  true  in  general 
state  space  theory.  Periodicity  for  general  state  spaces 
can  be  defined  in  much  the  same  way  as  for  discrete 
spaces.  The  concept  of  irreducibility  is  a little  more  com- 
plicated, since  individual  states  are  usually  not  hit  with 
positive  probability.  It  is  therefore  necessary  to  speak  of 
irreducibility  with  respect  to  a measure.  In  the  present 
context,  a natural  choice  for  this  measure  is  it  itself.  We 
will  therefore  say  that  a Markov  chain  is  7r-irreducible 
if  for  every  set  A with  it(A)  > 0 the  probability  of  the 
chain  ever  entering  A is  positive  for  every  starting  point 
x of  the  chain. 

Irreducibility  and  aperiodicity  need  to  be  verified  for 
each  Markov  chain.  Some  useful  sufficient  conditions  are 
available  for  certain  Metropolis  chains.  For  example,  a 
random  walk  chain  is  jr-irreducible  and  aperiodic  if  the 
increment  density  is  positive  on  a neighborhood  of  the 
origin  and  the  density  it  is  positive  on  all  of  IR*.  An 
independence  chain  is  7r-irreducible  and  aperiodic  if  the 
candidate  generation  density  / is  positive  whenever  the 
density  7r  is  positive. 

If  a chain  with  invariant  distribution  x-irreducible  and 
aperiodic,  then  it  can  be  shown  that  the  chain  must  be 
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positive  recurrent  and  that  for  7r-almost  all  x, 

II  Pn(*.  •)-*(•)  ll-o 

where  ||  • ||  denotes  the  total  variation  distance  and  Pn 
is  the  distribution  after  n steps  of  the  chain  started  at 
x. 

If  the  chain  is  Harris  recurrent,  then  this  convergence 
occurs  for  all  x.  The  definition  of  Harris  recurrence  is 
somewhat  technical,  but  a simple  sufficient  condition  is 
available  that  is  satisfied  by  all  7r-irreducible  Metropolis 
chains  and  essentially  all  ir-irreducible  Gibbs  samplers. 

A ^"-irreducible  aperiodic  Markov  chain  with  invari- 
ant distribution  n is  called  ergodic  if  it  is  aperiodic  and 
positive  Harris  recurrent. 

3.2  Rates  of  Convergence 

Once  we  know  that  the  distribution  of  a chain  converges 
to  7 r,  the  next  question  is  to  determine  the  rate  of  conver- 
gence. The  theory  presented  in  Nummelin  (1984)  pro- 
vides several  classifications  for  rates  of  convergence  of 
ergodic  chains: 

Degree  2:  If  a chain  is  ergodic  of  degree  2,  then 
n||  P"(s,  .)-*(■)  ||- 0 

for  7r-almost  all  x. 

Geometric:  An  ergodic  chain  is  geometrically  ergodic 
if  ||  Pn(x,  •)  — 7r(‘)  ||<  M(x)rn  for  some  r < 1 and 
some  function  M with  f Mdx  < oo.  • 

Uniform:  An  ergodic  chain  is  called  uniformly  ergodic 
if  ||  Pn(x,  •)  — tt(')  ||<  Mrn  for  some  r < 1 and  some 
constant  M. 

Uniform  ergodicity  is  the  strongest  of  these  forms  of 
convergence  and  it  is  the  easiest  form  to  work  with.  A 
necessary  and  sufficient  condition  for  a chain  with  kernel 
P to  be  uniformly  ergodic  is  that  there  exist  a probabil- 
ity u , a constant  /?  > 0 and  an  integer  n > 1 such  that 
i/(A)  < Pn(x,A ) for  all  A and  x.  Using  this  condition,  it 
is  possible  to  derive  a variety  of  sufficient  conditions  for 
uniform  ergodicity  For  example,  if  fi(E)  < oo  and  the 
densities  q and  ir  are  bounded  and  bounded  away  from 
zero,  then  the  corresponding  Metropolis  kernel  is  uni- 
formly ergodic.  As  another  example,  an  independence 
Metropolis  kernel  is  uniformly  ergodic  if  the  weight  func- 
tion w(x)  is  bounded. 

This  condition  can  also  be  used  to  derive  coi  litions 
for  uniform  ergodicity  of  hybrid  kernels  in  terms  c (’con- 
ditions on  the  component  kernels.  For  mixtures  th.  con- 
dition is  particularly  simple:  if  P is  uniformly  ergodic, 


then  any  mixture  using  P with  positive  probability  is 
uniformly  ergodic.  For  cycles  a slightly  more  compli- 
cated condition  appears  to  be  needed:  if  P is  used  in 
a cycle  and  there  exists  a probability  v and  a constant 
/?  > 0 such  that  u(A)  < P(x,  A)  for  all  A and  x,  then  the 
cycle  is  uniformly  ergodic.  This  condition  is  satisfied  if  P 
is  an  independence  kernel  with  a bounded  weight  func- 
tion. Combining  such  a kernel  in  a mixture  or  a cycle 
with  any  other  kernel,  such  as  a Gibbs  kernel,  therefore 
insures  that  the  hybrid  chain  is  uniformly  ergodic.  This 
provides  theoretical  support  for  using  occasional  inde- 
pendence “restart”  steps  together  with  a Gibbs  sampler 
to  improve  the  properties  of  the  sampler. 

3.3  Limiting  Behavior  of  Averages 

In  Markov  chain  methods,  sample  path  averages  are  used 
to  estimate  expectations  under  the  distribution  tt.  A law 
of  large  numbers  and  a central  limit  theorem  insure  that 
these  estimates  converge  at  reasonable  rates.  The  law 
of  large  number , follows  from  the  ergodic  theorem  and 
needs  no  conditions  other  than  existence  of  the  expecta- 
tion under  7r: 

Law  of  Large  Numbers.  If  P is  ergodic  with  invari- 
ant distribution  ir,  and  n\f  \ < oo,  then  for  any  initial 
distribution 

In  = — ’i,/=  / /(*)*(&) 

n i= i J 

almost  surely. 

A central  limit  theorem  does  appear  to  require  some 
conditions  on  the  rate  of  convergence  of  the  chain: 

Central  Limit  Theorem.  If  P is  ergodic  of  degree 
2 with  7 xP  = 7T,  and  f is  bounded,  then  for  any  initial 
distribution  the  distribution  of 

Vn(fn  - tt/) 

converges  weakly  to  a normal  distribution  with  mean  zero 
and  variance  v2(f). 

Weaker  but  more  complicated  sufficient  conditions  are 
available.  Expressions  for  the  asymptotic  variance  c2(/) 
are  available  for  finite  E (Peskun,  1973;  Kemeny  and 
Snell,  1976).  Other  expressions  involving  certain  hitting 
times  are  available  for  general  state  spaces  (Nummelin, 
1984).  These  expressions  do  not  appear  to  be  useful  for 
computing  the  asymptotic  variance. 
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4 Using  a Markov  Chain 

Once  a Markov  chain  strategy  with  satisfactory  theoret- 
ical properties  has  been  selected,  it  can  be  used  to  es- 
timate numerical  characteristics  or  to  provide  graphical 
views  of  features  of  the  posterior  distribution. 

4.1  Numerical  Uses 

Using  Markov  chains  for  calculating  numerical  character- 
istics of  a posterior  distribution  is  in  principle  straight 
forward:  expectations  with  respect  to  tt  can  be  approx- 
imated by  sample  path  averages.  There  are,  however, 
a number  of  issues  that  need  to  be  considered  before 
running  a chain. 

4.1.1  Choosing  a Sampling  Plan 

The  first  issue  concerns  the  choice  of  a sampling  plan. 
There  are  two  < xtreme  approaches.  Several  authors  have 
proposed  that  Markov  chains  should  be  used  to  generate 
n independent  realizations  from  the  posterior  by  using 
n separate  runs,  each  of  length  m,  and  retaining  the  fi- 
nal states  from  each  chain.  The  run  length  m is  to  be 
chosen  large  enough  to  insure  that  the  chain  has  reached 
equilibrium.  An  alternate  approach  is  to  use  a single 
long  run,  or  perhaps  a small  number  of  long  runs.  Ex- 
perience and  theoretical  assessments  in  the  simulation 
literature  appear  to  favor  the  use  of  long  runs  (Bratley 
ei  al,  1987,  Section  3.1.1;  Kelton  and  Law,  1984).  The 
major  drawback  of  using  short  runs  is  that  it  is  virtually 
impossible  to  tell  when  a run  is  long  enough  based  on 
such  runs.  Even  using  long  runs,  determining  how  much 
of  the  initial  series  is  affected  by  the  starting  state  is  very 
difficult,  but  some  literature  on  the  subject  is  available 
(Ripley,  1987,  Section  6.1).  A second  drawback  of  short 
runs  is  that  it  makes  inefficient  use  of  the  data:  only  n 
out  of  a total  of  nm  data  points  are  used.  With  a single 
run  of  length  nm  it  is  possible  to  use  all  the  data,  after 
possibly  discarding  a small  initial  fraction. 

A complication  that  does  arise  from  the  dependence 
in  using  a single  series  is  that  variances  of  estimates  are 
hard  r to  assess.  Again  the  simulation  literature  offers 
several  alternatives,  such  as  the  use  of  batch  means  and 
timeser  .'’.  analysis  (Bratley  et  al.,  1987,  Chapter  3;  Rip- 
ley, 19l.\  Chapter  6).  For  some  purposes  it  may  never- 
theless be  useful  to  have  an  approximate  independent 
sample  from  the  posterior.  Using  long  runs  this  can 
be  achieved  by  retaining  every  r-th  point  of  a sample 
path.  The  number  r of  points  to  skip  in  order  to  pro- 
duce approximate  independence  can  usually  be  chosen 
much  smaller  than  the  number  m of  steps  needed  to 
reach  approximate  equilibrium,  since  small  amounts  of 


correlation  are  usually  much  less  serious  than  biases  in 
estimates  of  means. 

4.1.2  Determining  the  Run  Length 

Another  consideration  is  to  determine  the  total  sample 
size  or  run  length  required  for  accurate  estimates.  For 
an  i.i.d.  sample  of  size  n,  the  standard  deviation  of  the 
sample  mean  of  a function  f(6)  is  <r/\Jn,  whare  a is  the 
posterior  standard  deviation  of  f(Q).  If  a preliminary 
estimate  of  cr  is  available,  perhaps  from  an  asymptotic 
analycis,  then  this  can  be  used  to  estimate  the  sample 
size  that  would  be  required  in  i.i.d.  sampling.  In  de- 
pendent sampling,  observations  are  generally  positively 
correlated  and  a larger  simple  size  will  be  required.  If  the 
series  is  modeled  as  a first  order  autoregressive  process, 
then  the  standard  deviation  of  the  sample  mean  is 

<r  /I  + p 

y/n\  1 - p 

where  again  <r  is  the  posterior  standard  deviation  of  f(6) 
and  p is  the  autocorrelation  of  the  series  f{Xn).  A rough 
estimate  of  p can  thus  be  used  to  adjust  the  sample  size 
for  dependence  in  the  series. 

Instead  of  determining  a fixed  sample  size  in  advance, 
it  is  also  possible  to  use  sequential  or  batch  sequential 
rules  for  determining  when  to  stop  sampling.  Since  prior 
information  on  the  values  of  the  posterior  mean  and 
standard  deviation  is  often  available  form  initial  anal- 
yses, Bayesian  sequential  methods  are  a natural  choice. 
Batching  can  be  used  to  insure  that  an  assumption  of 
normality  for  batched  means  is  reasonable. 

One  sequential  approach  that  should  be  avoided  is  to 
plot  successive  sample  means  and  stop  sampling  when 
the  means  appear  to  have  converged.  Since  sample 
means  change  by  increments  on  the  order  of  0(n-1)  but 
errors  are  of  order  0(n-1/2),  this  approach  will  produce 
sample  sizes  that  are  too  small.  The  presence  of  positive 
correlations  in  Markov  chain  series  makes  these  series 
appear  to  have  converged  even  earlier,  even  though  the 
correlations  imply  that  errors  are  larger  and  thus  larger 
sample  sizes  are  required  than  with  i.i.d.  sampling. 

4.1.3  Numerical  Issues 

Some  consideration  of  numerical  stability  is  needed  in 
using  any  sampling  based  method.  Expressions  used  to 
evaluate  log  posterior  densities  obtained  by  translating 
mathematical  formulas  into  a computer  language  are  of- 
ten reasonably  stable  near  the  posterior  mode  but  not 
far  away  from  the  posterior  mode.  This  can  lead  to  over- 
flows or,  on  IEEE  hardware,  results  that  are  NAN’s  or 
INF’s.  One  way  to  avoid  these  problems  is  to  carefully 
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study  tht  formulas  for  evaluating  the  log  posterior  den- 
sity and  modify  them  to  be  numerically  stable  even  for 
extreme  parameter  values.  The  effort  required  to  do  this 
can  be  considerable.  An  expedient  alternative  that  is  of- 
ten effective  is  to  truncate  the  parameter  space  to  a rea- 
sonable range  that  contains  essentially  all  the  posterior 
probability  and  for  which  the  posterior  density  formula 
is  numerically  stable.  This  truncation  also  often  insures 
that  a Markov  chain  used  to  sample  from  tt  is  uniformly 
ergodic  and  thus  improves  the  beha.ior  of  the  Markov 
chain  estimates. 

The  need  to  allow  truncation  is  an  important  consider- 
ation in  developing  software  for  implementing  sampling 
based  methods.  Subroutines  must  allow  for  user  sup- 
plied range  test  functions  or  allow  the  results  returned 
by  the  log  posterior  subroutine  to  indicate  a parameter 
that  is  outside  of  the  range. 

A numerical  issue  that  is  unique  to  Markov  chain 
methods  is  the  possibility  that  rounding  may  introduce 
absorbing  states.  If  this  happens,  results  obtained  from  a 
Markov  chain  method  may  be  meaningless.  Again  trun- 
cation away  from  areas  of  the  state  space  where  such 
rounding  may  occur  can  be  helpful. 

4.1.4  Variance  Reduction 

As  with  any  simulation  method,  variance  reduction  tech- 
niques can  often  significantly  reduce  the  sample  sizes  re- 
quired for  accurate  estimates.  Standard  variance  reduc- 
tion methods  such  ns  importance  sampling,  antithetic 
variates,  conditioning,  and  control  variaies  (Bratley  ei 
ai,  1987,  Chapter  2;  Ripley,  1987,  Chapter  5)  can  be 
used  with  any  Markov  chain  method. 

Importance  sampling  can  be  used  as  a variance  reduc- 
tion method  by  using  a Markov  chain  with  equilibrium 
distribution  / instead  of  n and  then  weighting  sample 
results  with  appropriate  importance  weights.  Condition- 
ing is  often  useful  in  Gibbs  samplers,  since  the  assump- 
tions required  for  the  Gibbs  sampler  imply  that  condi- 
tional means  or  densities  of  one  parameter  given  the  rest 
are  usually  available.  Gelfand  and  Smith  (1990)  refer  to 
this  use  of  conditioning  as  Rao-Blackwelln-ation. 

Antithetic  variation  can  be  introduced  into  a Markov 
chain  method  by  using  a MeLopolis  step  in  which  a can- 
didate step  is  obtained  by  reflecting  the  current  state  of 
the  chain  through  a point.  If  the  posterior  density  is  ap- 
proximately symmetric  about  this  point,  then  the  sam- 
ple will  be  also,  and  the  resulting  negative  correlations 
will  reduce  variances  of  estimates  of  linear  functions  of 
0.  This  technique  can  also  be  used  to  take  advantage 
of  approximate  axial  symmetries  in  a posterior  distribu- 
tion. 

One  way  to  introduce  control  variates  into  a Markov 


chain  method  is  to  use  the  sample  path  with  importance 
weights  to  calculate  estimates  of  normal  approximations 
and  to  correct  for  the  errors  in  these  estimates. 

4.1.5  Monitoring  Sampler  Performance 

In  using  Markov  chain  methods,  it  is  important  to  mon- 
itor the  performance  of  the  samplers  to  insure  that  they 
are  not  exhibiting  any  unusual  behavior.  Gelfand  and 
Smith  (1990)  propose  the  use  of  quantile  plots  to  moni- 
tor performance.  Monitoring  sample  paths  of  estimates 
is  also  useful  for  this  purpose,  as  is  monitoring  autocor- 
relations of  the  parameters.  Adaptive  time  series  models 
may  also  be  useful  for  determining  whether  a series  ex- 
hibits any  unusual  features. 

For  Metropolis  chains  it  is  also  important  to  keep  track 
of  the  the  number  of  candidates  that  are  rejected.  For 
an  independence  chain,  the  proportion  of  rejections  can 
be  related  to  the  total  variation  distance  between  the 
posterior  density  tt  and  the  candidate  generation  density 
/. 

By  monitoring  the  performance  of  a sampler,  in  par- 
ticular in  the  early  stages,  it  is  possible  to  experiment 
with  different  setting  for  sampler  parameters  to  obtain 
samplers  that  are  efficient  for  a particular  problem.  More 
work  is  needed  to  find  good  strategies  for  making  such 
parameter  adjustments. 

4.2  Graphical  Uses 

Numerical  summaries,  such  as  posterior  means,  standard 
deviations,  marginal  densities,  and  correlations,  provide 
insight  into  the  uncertainty  about  one  or  perhaps  two 
features  of  0 at  a time.  For  understanding  uncertainty 
in  higher  dimensions  graphical  methods  may  be  more 
useful  than  numerical  summaries. 

4.2.1  Plotting  Samples 

For  three-dimensional  quantities,  one  useful  graphical 
method  available  on  microcomputers  and  workstations 
with  bitmapped  displays  is  a rotatable  three-dimensional 
scatterplot.  By  selecting  every  r-th  entry  in  a Markov 
chain  sample  path  we  can  obtain  an  approximate  i.i.d. 
sample  from  the  posterior  distribution  and  display  this 
sample  in  a rotatable  scatterplot.  Three-dimensional 
structures  will  readily  become  apparent  as  the  point 
cloud  of  the  sample  is  rotated. 

Rotatable  scatterplots  are  only  useful  for  examining 
three  dimensions  at  a time.  A method  that  may  be  use- 
ful for  higher  dimensions  is  the  Grand  Tour.  Again  an 
approximate  i.i.d.  sample  can  be  selected  and  displayed 
in  a Grand  Tour.  Implementations  of  the  Grand  Tour 


Figure  1:  Posterior  mean  of  a response  function. 

are  only  now  becoming  widely  available,  so  extensive  ex- 
perience with  this  method  is  not  yet  available.  Early 
results  suggest  that  this  method  is  reasonably  effective 
for  detecting  structures  in  four  to  six  dimensions. 

4.2.2  Controlling  Animations 

If  8 is  more  than  five-  or  six-dimensional,  then  it  may  be 
difficult  enough  to  understand  8 itself,  much  less  uncer- 
tainty about  8.  If  a graphical  view  of  8 is  available  that 
is  meaningful  for  particular  values  of  8,  then  one  way  of 
developing  an  understanding  of  the  uncertainty  about  0 
is  to  look  at  an  animated  version  of  the  graph  in  which 
0 is  moved  through  a variety  of  values  that  are  plausible 
under  the  posterior  distribution. 

As  an  example,  suppose  we  have  a smooth  response 
function  8 of  a real  variable  x in  some  interval  I that  is 
measured  with  error.  Thus  we  obtain  measurements  of 
the  form 

Y = 0(x)  + e. 

Our  prior  opinion  on  the  function  8 suggests  that  this 
function  is  smooth,  but  does  not  suggest  any  particular 
parametric  structure. 

Several  approaches  are  available  for  specifying  such  a 
prior  distribution.  Most  involve  choosing  a prior  on  co- 
efficients in  some  representation,  such  as  a power  series 
or  spline.  The  coefficients  of  these  representations  are 
not  likely  to  be  particularly  meaningful.  But  a plot  of 
the  response  function  6 over  the  interval  I is  readily  un- 
derstood. Figure  1 shows  a plot  of  the  posterior  mean  of 


Figure  2:  A second  response  function  supported  by  the 
posterior  distribution. 

8 for  a particular  example.  This  mean  exhibits  a number 
of  features,  such  as  a pronounced  global  minimum  and 
a secondary  local  minimum.  Are  these  features  really 
present  in  8 or  are  they  merely  artifacts  of  the  poste- 
rior mean?  One  way  to  answer  this  question  is  to  look 
at  other  functions  8 that  are  supported  by  the  posterior 
distribution.  This  can  be  done  by  running  an  animation 
that  shows  graphs  of  different  values  of  6. 

To  provide  a good  understanding  of  the  posterior  dis- 
tribution, an  animation  needs  to  visit  all  areas  supported 
by  the  posterior.  In  addition,  to  allow  the  user  to  keep 
track  of  the  changes  in  8 as  it  moves  through  the  poste- 
rior distribution,  the  animation  has  to  move  smoothly. 
These  objectives  can  be  achieved  using  a random  walk- 
driven  Metropolis  chain  with  the  posterior  distribution 
as  its  equilibrium  distribution.  Using  the  posterior  as 
the  equilibrium  insures  that  the  chain  does  eventually 
approach  all  possible  values  of  8 but  spends  most  of  its 
time  near  values  that  are  better  supported  by  the  pos- 
terior distribution.  The  correlation  in  the  random  walk 
insures  that  the  chain  moves  in  small  steps,  thus  provid- 
ing the  visual  continuity  that  is  necessary  for  an  effective 
animation.  Thus  the  correlations  in  the  Metropolis  chain 
that  are  a nuisanct  for  numerical  computations  are  in 
fact  an  advantage  for  this  graphical  application.  Conti- 
nuity can  be  further  enhanced  by  interpolating  between 
steps  of  the  random  walk. 

Figure  2 shows  another  view  of  the  animation.  View- 
ing the  animation  for  this  particular  example  for  a few 
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minutes  quickly  reveals  that  the  global  minimum  is  quite 
well  defined  but  the  shape  of  the  left  half  of  the  curve  is 
very  uncertain. 

A useful  enhancement  for  this  animation  is  the  bar 
shown  at  the  left  of  the  two  plots.  The  solid  part  of  the 
bar  represents  the  probability  content  in  the  posterior 
at  or  below  the  level  of  the  current  6,  computed  using  a 
X2  approximation.  This  gives  a quick  indication  of  how 
plausible  the  current  view  is. 

Many  variations  on  this  animation  are  possible.  For 
example,  using  the  posterior  distribution  as  the  equilib- 
rium of  the  driving  Markov  chain  is  a reasonable  starting 
point  but  is  not  essential.  At  times  it  may  be  useful  to 
force  the  chain  to  concentrate  its  motion  closer  to  the 
mode,  or  to  move  farther  away  from  the  mode  and  pos- 
sibly find  interesting  features  that  are  farther  away.  This 
can  be  accomplished  by  using  a Markov  chain  with  an 
equilibrium  density  that  is  a power  of  the  posterior  den- 
sity - by  “cooling”  or  “heating”  the  posterior  distribu- 
tion in  the  terminology  of  simulated  annealing. 

Much  additional  work  is  needed  to  explore  ways  of 
merging  numerical  methods  such  as  the  ones  described  in 
this  paper  with  new  computing  hardware  that  is  now  be- 
coming more  widely  available.  The  animation  described 
here  is  a first  step  in  that  direction. 
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